Re-evaluation of publicly available gene-expression databases using machine-learning yields a maximum prognostic power in breast cancer

Gene expression signatures refer to patterns of gene activities and are used to classify different types of cancer, determine prognosis, and guide treatment decisions. Advancements in high-throughput technology and machine learning have led to improvements to predict a patient’s prognosis for different cancer phenotypes. However, computational methods for analyzing signatures have not been used to evaluate their prognostic power. Contention remains on the utility of gene expression signatures for prognosis. The prevalent approaches include random signatures, expert knowledge, and machine learning to construct an improved signature. We unify these approaches to evaluate their prognostic power. Re-evaluation of publicly available gene-expression data from 8 databases with 9 machine-learning models revealed previously unreported results. Gene-expression signatures are confirmed to be useful in predicting a patient’s prognosis. Convergent evidence from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document}≈ 10,000 signatures implicates a maximum prognostic power. By calculating the concordance index, which measures how well patients with different prognoses can be discriminated, we show that a signature can correctly discriminate patients’ prognoses no more than 80% of the time. Additionally, we show that more than 50% of the potentially available information is still missing at this value. We surmise that an accurate prognosis must incorporate molecular, clinical, histological, and other complementary factors.

in the process of training.Below we provide a short description of the machine learning models used in our study.The Cox proportional hazards model -while not a machine learning model -is used as a baseline.The Machine Learning in R package (mlr) 53 was employed to benchmark models and perform cross-validation.All results are based on 5 repeats of 5-fold cross-validation.All results below correspond to the tuned number of features: We tuned the number of features during each cross-validation fold so that a tuned (optimal) number of features was used for the eventual prediction.Other hyper-parameters are listed in Table 3.

Cox proportional hazards model
The Cox proportional hazards model can be regarded as the standard model for analyzing survival data 54 .In this model, the effect of variables -also called covariates x 1 , x 2 , ..., x n -on the time to an event of interest is evaluated.For example, an event might be the death of the patient or a relapse of the disease.Formally, the Cox model is expressed by the following hazard function: where β 1 , β 2 , ..., β n of n patients denote the regression coefficients, i.e. weights of the covariates: the larger the coefficient the larger effect its covariate has on the prognosis of survival times.They are estimated by maximizing the partial likelihood.The baseline hazard function h(t) remains unspecified since it is divided out by computing the proportional hazard.
The Cox model remains a highly robust model if applied to linearly independent data and under the assumption that the proportional hazard does not change over time.However, this model loses its robustness when applied to high-dimensional data.

Lasso, Ridge, and Elastic-Net Regressions
Since the Cox model generalizes poorly to high dimensional data, some penalizing constraints are often used in the process of maximizing the partial likelihood.As a consequence, the regression coefficients shrink toward zero, their variances reduce as well, and the less important covariates tend to have less effect in the model.
L1 and L2 regularizations are two standard forms of regularization: where λ is the regularization constant.The L1 regularization is also known as LASSO regression and produces models with a smaller set of coefficients since several coefficients are completely reduced to zero.Thus, variable selection is also performed during the fitting process.
The L2 regularization is also known as ridge regression and shrinks all coefficients by the same factor.As a result, all coefficients are reduced but none is eliminated.
The L1 regularization cannot select more variables than the number of samples.Moreover, it is biased toward the selection of groups of correlated variables 55 .
To overcome these limitations, one can use a linear combination of L1 and L2 penalties, which is then called the elastic net regression.The elastic net regression is especially useful when the number of variables is larger than the number of samples 55 .In our study, we evaluated the extensions of the lasso, ridge, and elastic net regressions to the Cox model 56,57 .

Boosted models
Boosting is an ensemble learning technique that combines the so-called weak i.e. base learners into stronger learners that are trained sequentially 58 .During each iteration, a new model is added to the ensemble correcting the errors of the previous model.Boosting has been adapted to survival analysis 59,60 .
We used the gradient boosting in our study 61 .This type of boosting trains on the residual errors (gradients) of the entire ensemble model at each step.It can also be trained with linear models as base learners as well as with decision trees as base learners.In this study, both methods were assessed.

Survival Trees and Random Survival Forests
Survival trees 62 and random survival forests 63 are an extension of decision trees 64,65 and the random forests algorithm developed by Leo Breiman 66 to censored survival data.Decision trees and random forests, in turn, are nonparametric regression and classification methods that are well suited for the case, where the number of variables is greater than the number of samples, for example in genetics.

Survival Trees
In the general description of decision trees, the space spanned by predictor variables, i.e. by the covariates, is recursively partitioned into several groups such that observations with similar responses are grouped together.In the case of numeric variables such as gene expression values, binary splits are conducted.For the splitting a variable and selecting a splitting threshold, decision trees follow the principle of impurity reduction.Following this principle, each split in the tree results in daughter nodes whose impurity is reduced in comparison to the parent nodes.The impurity can be measured with the Shannon entropy or the Gini index or other statistics.Finally, in an ensemble of trees predictions are made by means of averaging and combining the results of each decision tree.

Random Forests
Random survival forests aggregate the results from ensembles of decision trees, whereas each tree is generated from a bootstrap sample of the data.At each node, a random subset of predictor variables is sampled and one variable is selected to split on.The selected variable maximizes the difference in survival between daughter nodes.Mathematically, the log-rank statistic over all available split points and variables is maximized.For prediction, an average over the predictions of the single trees is used (a vote is used for a classification problem).
Both survival trees and random survival are able to robustly handle high-dimensional non-linear data and detect interactions among them.Provided the depth of trees is chosen carefully, they also reduce the tendency of overfitting the data.However, both algorithms are biased towards selecting more heterogeneous variables, i.e. variables with many possible split points are preferred.To overcome this problem, one can use the conditional inference forests 67 that select the split points based on linear rank statistics.Nonetheless, to detect non-linear effects in the predictor variable space, selecting the split points using maximally selected rank statistics can be conducted 68 .
Alongside survival trees, we evaluated both the standard random survival forests and the maximally selected rank statistics random survival forests.
The full names of the prognostic models, respective hyper-parameters as well as packages and functions used in this study are shown in Table 3.

Supplementary Gene Selection Methods
During variable selection, a subset of for the survival outcome relevant variables is selected.We applied 5 different gene selection methods and measured the performance of the survival models described above.
In a univariate model, a univariable Cox proportional hazards model is fitted to the expression values of each gene and the genes are ranked by the resulting C-index of the corresponding model.The variable importance of the random forests algorithm is computed by permuting the column containing the expression values of each gene and calculating the difference between the performances of the survival model before and after permutation.Subsequently, the genes are ranked based on these differences.
The survival random forests variable hunting method, on the other hand, use a different importance score.First, the standard variable importance is conducted on the entire dataset.Second, a random subset of genes is selected with probability proportional to the calculated variable importance, and a forest is fitted.Third, the selected genes are ordered by the shortest distance from the tree root to the largest subtree including this gene as its root; they are added successively to the fitting model until the joint importance does not increase anymore.These steps are iterated a specified number of times.Eventually, the variable importances result from the ranking of the variables based on the frequency of occurrence in these iterations.
The Minimum Redundancy Maximum Relevance algorithm 69 selects variables that are mutually far away from each other: variables that are mutually close to each other might be redundant.Thus, the algorithm minimizes redundancy by removing the potentially redundant variables.At the same time, the selected variables are highly correlated with the response variable such as survival time, meaning that they exhibit maximum relevance.
The Conditional Variable Importance for Random Forests utilizes the linear rank statistics of conditional random forests described above.
The full names of the selection methods, respective hyper-parameters as well as packages and functions used in this study are shown in Table 4.

Supplementary Random Signature Size
According to the rule of thumb that Cox proportional hazards models should be used with a minimum of 10 events per predictor variable (EPV), we should use 2 to 16 random genes in our datasets, since TCGA and METABRIC contain the smallest and largest numbers of 17 and 164 events, respectively.However, this rule is based on two simulation studies and may be relaxed 70 .Moreover, a study investigating this rule of thumb in 2 million anonymized patient records suggested that sample size for developing prognostic models is not simply related to EPV and that EPV should be dataset dependent.For these reasons, we randomly sampled gene sets containing a different number of genes ranging from 1-101 for each dataset.The sampling was repeated 100 times for each number of genes, Cox models were fitted, and the median C-index was calculated.As can be seen in the top Figure 1, the prognostic power is indeed dataset dependent.For all datasets except METABRIC, we could investigate a limited number of genes, since the Cox model does not converge with a smaller number of events.Nonetheless, we see that prognostic power reaches a plateau if a sufficient number of events is considered (METABRIC).For the rest datasets, the optimal number of genes seems to be in the range of 15-25 genes.In the bottom Figure 1, this range can be inspected more closely.As can be seen, the distribution of the median C-indices increases in the range from 1-15 genes in a gene set, after which it seems to fluctuate around a constant value.Chou et.al. have shown that the optimal number of genes in a signature lies around 20 and that with a larger number of genes, a model tends to overfit data (Figure 4 in 71 ).Moreover, most clinically relevant gene-expression signatures tend to contain a smaller number of genes varying from 2-50 (Supplementary Reported Gene Selections).Thus, we chose to sample 20 random genes in all datasets.

Supplementary Random Signature Superiority
In order to examine whether Random Signature Superiority (RSS) is present in this study, we we calculated the number of random signatures performing above the C-index of the reported 26-gene signature 41 (which has roughly the same size as random signatures, see Supplementary Reported Gene Selections) for each prognostic model and averaged this value over all datasets.The results are shown in Figure 5.We found (Supplementary Random Signature Superiority) that more than 60% of random signatures outperform the aforementioned reported signature in 4 of 8 datasets, exactly 49% in one dataset, and less than 22% in the remaining 3 datasets.Averaging across datasets, 44% of random signatures outperform the aforementioned reported signature.

Supplementary Comparison Random And Reported Signatures
In order to inspect the differences in prognostic power between random and reported selection methods, we plot the distributions in form of the violin plots for each model and each dataset in Figure 2. The distributions are compared using the Wilcoxon rank sum test.The significance levels are shown below the violin plots.As can be inspected here, reported signatures tend to have higher C-indices than random signatures, although the level of statistical significance varies across models and datasets.In order to compare the measures, we specifically utilized random signatures in the METABRIC dataset.

Figure 1 .
Figure 1.Relationship of median C-index and the number of genes in a random gene set.The median C-index was calculated based on 100 genes set sampled at random for each data point.The Cox proportional hazards model was used as a prognostic model.(top) Median C-indices computed based on the individual dataset.(bottom) Distribution of C-indices for all datasets in the range from 1-20 genes in a gene set.

Figure 2 .
Figure 2. Comparison of C-indices for random and reported selection methods.

Figure 3 .
Figure 3. Median of the sample medians in relation to the number of subjects.

Figure 4 ./ 19 Figure 5 .
Figure 4. Median of the sample medians in relation to the event rate.

Figure 6 . 19 Figure 8 .
Figure 6.Median absolute deviation in relation to the event rate.

Figure 10 .
Figure 10.Comparison of the C-index with the Area Under the Curve, Pearson and Spearman correlation coefficients.In order to compare the measures, we specifically utilized random signatures in the METABRIC dataset.

Table 1 .
Descriptive statistics of common important clinical parameters for all datasets used in this study.

Table 2 .
Gene lists from published signatures considered in the analysis.

Table 3 .
Machine learning prognostic models and respective hyper-parameters used in this study.

Table 4 .
Machine learning selection methods and respective hyper-parameters used in this study.

Table 5 .
Evaluation of the Random Signature Superiority.